An Architecture for Checkpointing and Migration of Distributed Components on the Grid
نویسندگان
چکیده
Sriram Krishnan AN ARCHITECTURE FOR CHECKPOINTING AND MIGRATION OF DISTRIBUTED COMPONENTS ON THE GRID A computational Grid is a set of hardware and software resources that provide seamless, dependable, and pervasive access to high-end computational capabilities. The Grid differs from other computational resources such as traditional supercomputers and clusters by the following key features: (1) coordination of resources that are not subject to centralized control, (2) use of standard, open, general purpose protocols and interfaces, and (3) delivery of non-trivial qualities of service despite unpredictable resource availabilities. The Open Grid Services Architecture (OGSA) is the first effort to standardize Grid functionality, based on concepts from the Web services community. However, the Web services based OGSA presents a server-centric approach which is not very conducive to the orchestration of complex distributed applications where the interactions are not always viii of the client-server type. We present a distributed component based approach for composing complex applications on the Grid that is conformant with the Common Component Architecture (CCA), while maintaining compatibility with Grid standards. Because Grid resources are not subject to centralized control and are geographically distributed, their availabilities may be very dynamic in nature. Migration of individual components can be an effective strategy for dealing with dynamic resource availabilities. However, migration of components that are part of a distributed application is complicated due to the possible interactions between them during execution. We present an approach for migration of distributed components, in the presence of communication between them. Additionally, reliability of Grid resources is also very difficult to guarantee. Checkpointing applications and rolling back to a saved state is an effective form of fault tolerance for dealing with failures of such resources. However, due to the distributed nature of the applications, the checkpoints generated need to be globally consistent. We present our approach for checkpointing and restart of distributed components for fault tolerance purposes.
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملSupport for scheduling HLA–based simulations on the Grid
Developing distributed simulations that support runtime steering is an important issue. The High Level Architecture (HLA) [2] is a well known IEEE standard that fulfills many requirements of distributed interactive applications. HLA and the Grid complement each other to support distributed interactive simulations[6]. HLA is a good candidate for fulfilling the requirements of synchronization man...
متن کاملCheckpointing and Migration of Communication Channels in Heterogeneous Grid Environments
A grid checkpointing service providing migration and transparent fault tolerance is important for distributed and parallel applications executed in heterogeneous grids. In this paper we address the challenges of checkpointing and migrating communication channels of grid applications executed on nodes equipped with different checkpointer packages. We present a solution that is transparent for th...
متن کاملThe Architecture of the XtreemOS Grid Checkpointing Service
The EU-funded XtreemOS project implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in a distributed heterogeneous...
متن کاملIndependent checkpointing in a heterogeneous grid environment
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying gridnode checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004